b'\x0cManagement Advisory 05-01\nPage 2 of 10\n\nconstraints have limited its support of the legacy servers beyond routine maintenance and\noperations. For this reason, OCIO has encouraged its customer community to accelerate\nthe upgrading of their publicly accessible websites so that they can be moved to the newer\nservers. The customer community reports that it lacks the resources to do so. Because\nresource constraints will make legacy servers a necessary component of OCIO operations\nfor the near term, we recommend that OCIO work with its customer community to\nidentify a strategy for timely backups of legacy servers; inform customers how website\nexpansions affect backups; clarify when or if it will add spare drives to servers under its\ncontrol; clarify how it will test the restore process; and revise the timeframe for testing the\nrecovery plan for one of the clustered webservers. Further details on our findings are\nprovided below.\n\nNature and Impact of the Server Crash\n\nOur review disclosed that the Web4 server crashed when two of its four hard drives failed\nin succession. The server was an older model that hosted some of the Institution\xe2\x80\x99s public\nwebsites and applications that were not in compliance with the new standards developed\nby OCIO.1 The Web4 server was configured with built-in redundancy so that if one of the\nfour hard drives failed, the data would migrate and be shared with the remaining drives\nuntil the failed drive could be replaced. 2 If a second drive failed, however, the system\ncould not automatically rebuild itself and the data could not be shared between the two\nremaining drives.\n\nWhen the server failed, an audio alarm sounded, alerting OCIO that there had been a\nserious malfunction in the secured room that housed the server. The OCIO staff member\nwho examined the Web4 server in response to the alarm saw that the hard drive failure\n                       3\nindicator light was on. He went to his office to retrieve a replacement hard drive and on\nhis return to the secured room saw that a second drive had failed.\n\nThe Web4 server was sent to a private company to attempt a recovery of the data. The\ncompany noted that in attempting to rebuild the system, OCIO staff had overwritten data\non one of the remaining drives, inadvertently wiping out whatever data had been\nrecorded. Although the exact cause of the drive failures could not be determined4, the age\nof the server and its heavy use may have been contributing factors. The server was eight\nmonths beyond its three-year warranty period. OCIO officials told us that because they\noperate on a four-year replacement cycle, they accept a planned risk on expired\nwarranties. We noted that 8 of the 31 servers managed by OCIO\xe2\x80\x99s Web Server Division\n\n1\n    These standards include Smithsonian Directive 920, Life Cycle Management, and OCIO\xe2\x80\x99s Technical\n    Reference Model, IT-940-01.\n2\n    Web4 is a RAID5 server. RAID stands for \xe2\x80\x9credundant array of independent disks.\xe2\x80\x9d It is a way of storing\n    the same data in different places on multiple hard drives.\n3\n    OCIO staff in the Operations Center would check the indicator lights on all of the servers in the morning.\n     The alarm indicating a failure sounded before the Operations Center staff conducted its check.\n4\n     The engineers at the company could not determine the cause of the failure.\n\x0cManagement Advisory 05-01\nPage 3 of 10\n\nare out of warranty. Had the customer community whose operations were hosted on the\nWeb4 server been aware that the server was out of warranty, they might have been better\nprepared to address the risk of the server failure with OCIO.\n\nWhen the server failed, over 30 websites5 were temporarily lost, for two to six days.6 These\nsites included public web pages and services, such as the Archives of American Art and the\nHistory Wired site of the National Museum of American History, and Institution intranet\nsites, such as the National Museum of the American Indian\xe2\x80\x99s intranet and the Woodrow\nWilson International Center\xe2\x80\x99s site.\n\nOCIO was able to restore all of the downed websites by February 15, 2005 from backup\ncopies maintained by OCIO staff, except for the Smithsonian Online Academic\n                                     7\nAppointments (SOLAA) database. SOLAA, which provides mission-critical support to\nNMNH\xe2\x80\x99s Research Training Program, had not been backed up for approximately 18\nmonths. Program staff ultimately located a December 2004 copy of the database from a\ncontractor, but it was insufficient to allow the program to proceed. As a result, NMNH\ncancelled the 2005 Research Training Program and the planned fundraising efforts to\ncelebrate the program\xe2\x80\x99s 25th anniversary. NMNH officials told us the 2006 program may\nalso be in jeopardy if the SOLAA application is not recreated by the start of the program\nyear.\n\nGiven that the Web4 and other legacy servers like it may have reliability problems, it is\ncritical that adequate backups be made so that data is not irretrievably lost when a server\nfails. However, OCIO, which is responsible for backing up the servers over which it\nexercises control, mistakenly had not backed up the SOLAA database. OCIO officials also\nstated that they lacked the resources to make the backups without temporarily removing\nthe system from production. To make the backups, data on the Web4 server would have\n                                                             8\nto be sent to OCIO\xe2\x80\x99s server, which resides behind a firewall, where the data is copied and\nretransmitted back to the Web4 server. This would have required that the SOLAA site be\ntaken down for a short period of time. We found that OCIO performs this type of\nbackup for other systems it operates. While there is a temporary disruption of service,\nOCIO posts notices to announce the backup schedule to alert users that the system will be\nunavailable.\n\nMoreover, on this and other legacy servers, customers were allowed to expand their\nindividual websites with new data and features, which resulted in more data needing to be\n\n5\n    The larger sites that were hosted on the Web4 server included Affiliations, SOLAA, Archives of American\n    Art, HistoryWired, Smithsonian Institution Libraries, Smithsonian Press\xe2\x80\x99s Smithsonian Legacies, and the\n    Woodrow Wilson International Center.\n6\n    All sites but one were returned to service by February 15, 2005.\n7\n    Data from the National Museum of the American Indian was also irretrievably lost, but it was not deemed\n    mission-critical.\n8\n    A firewall is a system designed to prevent unauthorized access to or from a private network such as an\n    intranet.\n\x0cManagement Advisory 05-01\nPage 4 of 10\n\nbacked up. OCIO committed to its customer community on its website that it would be\nresponsible for maintaining the equipment and backing up data under its control.\nHowever, OCIO staff told us that they had been experiencing problems meeting the\nbackup schedules requested by its customer community. These schedules largely fell\nwithin non-production hours (8:00 p.m. to 8:00 a.m.).\n\nPreventative Measures Planned by OCIO\n\nIn its root cause analysis report, OCIO outlined a number of measures it plans to take to\nprevent or otherwise mitigate future service disruptions and data losses on legacy servers.\nSince issuing the report, OCIO has taken many positive steps to implement these\nmeasures. For example, OCIO adopted new policies in March 2005 requiring that: (1)\nno data be excluded from standard backups of OCIO-maintained servers; (2) any failed\nhard drives removed from servers be kept pristine until data recovery has occurred; and\n(3) when a data loss occurs, all of the server\xe2\x80\x99s hard drives be sent for recovery within one\nday.\n\nMoreover, OCIO is making a strong effort to communicate with and involve the\ncustomer community in preventing and mitigating such problems in the future. OCIO is\nnow posting weekly backup status reports on the Institution\xe2\x80\x99s intranet (Prism) and has\nupdated its \xe2\x80\x9cfrequently asked questions\xe2\x80\x9d page on Prism to include recommendations on\nhow to protect certain production systems. OCIO has also posted on Prism a contact list\nand other key information for servers that it maintains and will record and track future\nprevention actions through completion.\n\nHowever, a number of other measures described in its report have not been or cannot be\nimplemented, primarily because of resource constraints. For example, the report stated\nthat OCIO will:\n\n      \xe2\x80\xa2   Reconfigure all RAID5 servers under its control by April 1, 2005, to add a fifth\n          drive as a \xe2\x80\x9chot spare\xe2\x80\x9d to allow automatic rebuilding of drives in the event of two\n          drives failing. OCIO officials told us that they presently do not have the resources\n          to accomplish this because it would require a rebuild of each server down to the\n          operating system level. Moreover, such a task would direct their limited resources\n          to the legacy servers rather than to the newer technologies and the newer systems\n          they are developing.\n\n      \xe2\x80\xa2   Test the restore process9 at the request of owners of systems and applications, and\n          then have owners test and verify the accuracy of the restoration. OCIO staff noted\n          that such a process would require exactly mirroring one system onto another\n          system, and OCIO lacks the legacy hardware needed to perform the restoration --\n\n9\n    To restore is to copy backup files from secondary storage to hard disk to return data to its original\n    condition if data has been damaged or to copy or move data to a new location.\n\x0cManagement Advisory 05-01\nPage 5 of 10\n\n            hardware that is becoming obsolete and, therefore, is not worth purchasing.\n            Moreover, members of the customer community we interviewed were not sure\n            that if they made such a request it would be acted upon, or whether OCIO could\n            meet the demand if several customers simultaneously requested such service. The\n            customer community expressed these concerns at the draft stage of OCIO\xe2\x80\x99s report,\n            but the final report does not address them.\n\n       \xe2\x80\xa2    Test the recovery plan10 for one of the clustered web servers in April 2005. As of\n            May 9, 2005, OCIO had not done so, although the customer community told us\n            that such an exercise was important to regain their confidence.\n\n\n\n\n10\n     A recovery plan consists of the precautions taken so that the effects of a disaster (e.g. loss of computers\n     and data) will be minimized, and the organization will be able to either maintain or quickly resume\n     mission-critical functions.\n\x0cManagement Advisory 05-01\nPage 6 of 10\n\nConclusions and Recommendations\n\nA substantial number of legacy servers managed by OCIO are operating on expired\nwarranties and are likely to face the same vulnerabilities as the Web4 server. OCIO has\naccepted server failure as an operating risk because resource constraints have limited its\nsupport of the legacy servers beyond routine maintenance and operations.\n\nGiven that additional failures of the legacy servers are likely, it is imperative that OCIO\nperform timely backups of customer data and applications so that they can be restored in\nthe event of a server failure. While OCIO has identified preventative measures it will take\nto mitigate future service disruptions and data losses, it has not addressed how it will\novercome current delays in performing timely backups of customer data on servers that it\nmanages. It will also need to address with customers how expansion of the customers\xe2\x80\x99\nindividual websites will affect OCIO\xe2\x80\x99s ability to meet data backup requirements and\nwhether controls should be imposed on such expansions.\n\nFurther, other measures aimed at providing additional drives and testing the restore and\nrecovery process may not be implemented as promised or may transfer responsibility to\nthe customers for services that OCIO should provide. Because legacy servers will remain\nan essential component of OCIO operations for the foreseeable future, OCIO will need to\nprovide its customer community assurances that any lost data or applications on the\nlegacy servers can be adequately recovered. To provide these assurances, we recommend\nthat OCIO:\n\n   1. Develop a plan, in coordination with its customers that describes how it will\n      ensure that timely backups on OCIO-maintained servers are performed.\n\n   2. Inform customers on how further website expansions on the legacy servers will\n      affect scheduled backups and what controls should be exercised over such\n      expansions.\n\n   3. Clarify whether RAID5 servers under its control will be reconfigured to add spare\n      drives and, if so, develop a timeframe for completing such actions.\n\n   4. Clarify how it would test the restore process for customer applications and\n      systems given that it lacks the legacy hardware required for such tests.\n\n   5. Provide a revised timeframe for testing the recovery plan for the clustered web\n      server it reported it would test in April 2005.\n\x0cManagement Advisory 05-01\nPage 7 of 10\n\nManagement Comments and Office of Inspector General Response\n\nWe discussed this report with OCIO officials, and their written comments (attached to\nthis report) have been incorporated, as appropriate. OCIO concurred with the report\xe2\x80\x99s\nfindings, conclusions, and recommendations and identified corrective actions to prevent\nor otherwise mitigate future service disruptions and data losses on legacy servers. By June\n30, 2005, OCIO will develop a plan for web server infrastructure operations that will\naddress timely data backups and the reconfiguration of RAID5 web servers to use hot\nspares. At an upcoming monthly Webmasters meeting OCIO also will discuss its plans\nfor an automated backup solution, the resource requirements associated with website\nexpansions, and a test of the restore process for customer applications and systems.\nFinally, by January 2006, after changes to enhance the redundancy of the web server\ninfrastructure have been completed, OCIO will test the recovery plan for the clustered\nweb server which was originally to be tested in April 2005.\n\nOCIO\xe2\x80\x99s proposed actions are responsive to our recommendations, and once\nimplemented, should address the issues raised in this report.\n\x0cManagement Advisory 05-01\nPage 8 of 10\n\nManagement Response\n\x0cManagement Advisory 05-01\nPage 9 of 10\n\nManagement Response (continued)\n\x0cManagement Advisory 05-01\nPage 10 of 10\n\nManagement Response (continued)\n\x0c'